-
Notifications
You must be signed in to change notification settings - Fork 105
fix deadlock in processWriteQueue #778
Conversation
LGTM |
the way to update a library is now like so:
looks like there's definitely some useful bugfixes in the gocql library we should have. |
tracing.Failure(span) | ||
tracing.Error(span, err) | ||
errmetrics.Inc(err) | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for better or worse, it used to be by design that we would continue here and just use as much data as we could get back. remember the days before @replay started, and we would see charts with missing data here and there because certain chunks timed out? we would see the timeout errors on our MT dashboard, and MT always tried to make the most out of it.
anyway the importance of "correct or error" (as opposed to "best effort") is become more and more clear as we see with some customers and their alerting queries. so 👍
@@ -290,6 +291,7 @@ func NewCassandraStore(addrs, keyspace, consistency, CaPath, Username, Password, | |||
omitReadTimeout: time.Duration(omitReadTimeout) * time.Second, | |||
ttlTables: ttlTables, | |||
tracer: opentracing.NoopTracer{}, | |||
timeout: cluster.Timeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that we default to 1000 ms, this seems too aggressive. looking at some customer dashboards, some customers are seeing multi-second query durations (cassGetExecDuration)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
this timeout is only used for writes. reads just use the existing requestContext, which we dont currently set a timeout on and instead just rely on the context being canceled, eg when the client disconnects.
-
queries will fail if they run longer then cluster.Timeout.
metrictank/vendor/github.com/gocql/gocql/conn.go
Lines 612 to 658 in ca4a891
var timeoutCh <-chan time.Time if c.timeout > 0 { if call.timer == nil { call.timer = time.NewTimer(0) <-call.timer.C } else { if !call.timer.Stop() { select { case <-call.timer.C: default: } } } call.timer.Reset(c.timeout) timeoutCh = call.timer.C } var ctxDone <-chan struct{} if ctx != nil { ctxDone = ctx.Done() } select { case err := <-call.resp: close(call.timeout) if err != nil { if !c.Closed() { // if the connection is closed then we cant release the stream, // this is because the request is still outstanding and we have // been handed another error from another stream which caused the // connection to close. c.releaseStream(stream) } return nil, err } case <-timeoutCh: close(call.timeout) c.handleTimeout() return nil, ErrTimeoutNoResponse case <-ctxDone: close(call.timeout) return nil, ctx.Err() case <-c.quit: return nil, ErrConnectionClosed }
So if we have queries that are running longer then this, then it will be because they are blocked waiting for a stream to become available.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- ah yes..
- in insertChunk we call
context.WithTimeout(context.Background(), c.timeout)
doesn't that mean that in the select either of the 2 operations (between lines 648 and 655 in the code shown above) can succeed at practically the same time?
one comes from the timeout set from c.timeout
which is the same timeout value we specified in the config, whereas the ctxDone fires when the context's deadline triggers, which is a deadline we set also based on the same timeout?
IOW i don't see the use for the context on the insert queries since they seem to achieve the same as the already setup timeout mechanism. am i missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
c.timeout is only measured from after the request is sent to cassandra.
the context.Deadline is set when the request is constructed, so includes the time the request sits waiting to be sent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rename this field to writeTimeout
to avoid confusion, and a note explaining what you said in your above comment would also help a lot to anyone reading the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's rename this field to writeTimeout to avoid confusion
No. CassandraStore.timeout is assigned the same value as the gocql cluster.Timeout, which is for reads and writes.
Variables should be named based on the information they contain, not on how they are used.
if err == context.Canceled || err == context.DeadlineExceeded { | ||
// query was aborted. | ||
return nil, nil | ||
} | ||
tracing.Failure(span) | ||
tracing.Error(span, err) | ||
errmetrics.Inc(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this line triggers this code:
func (m *ErrMetrics) Inc(err error) {
if err == gocql.ErrTimeoutNoResponse {
m.cassErrTimeout.Inc()
} else if err == gocql.ErrTooManyTimeouts {
m.cassErrTooManyTimeouts.Inc()
} else if err == gocql.ErrConnectionClosed {
m.cassErrConnClosed.Inc()
} else if err == gocql.ErrNoConnections {
m.cassErrNoConns.Inc()
} else if err == gocql.ErrUnavailable {
m.cassErrUnavailable.Inc()
} else if strings.HasPrefix(err.Error(), "Cannot achieve consistency level") {
m.cassErrCannotAchieveConsistency.Inc()
} else {
m.cassErrOther.Inc()
}
}
seems weird that various gocql timeouts result in err metrics being incremented, whereas context.DeadlineExceeded does not. I think we should also increment a timeout metric when context.DeadlineExceeded triggers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
store_cassandra.go does not set a context Timeout on the read path.
So if context.Canceled or context.DeadlineExceeded is returned, it is because the caller wanted us to give up on the request. There was no issue preventing us from executing the request.
currently, we dont set a deadline on the request context anywhere, so context.DeadlineExceeded will never be returned. It is just here for completeness incase we oneday add a deadline to the context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't we want context.DeadlineExceeded to mark the spans as failed and the ErrMetrics to be incremented? it seems to me any such deadline being hit is a failure that should be reported
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No. If DeadlineExceeded is reached it is because the caller set a timeout and they should emit their own error metrics and set the span to failed.
The error would have nothing to do with cassandra, so you should not increment the cassandraErr counters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to leave a comment indicating the requested changes.
I did, github :(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just some minor changes needed, see comments.
also don't forget #778 (comment) thanks
Also return an error when a query fails rather then just silently ignoring it.
This is good to merge @Dieterbe |
also: * newer version of dep uses multi-line format * it auto-added a bunch of constraints * needed to pin gocql. it was tricky to determine which version our gocql is supposed to be at the last update was #778 but we don't know what version of gocql that was exactly. None of the versions in gocql's last year of git history matches what we have in our vendor dir, in fact, the smallest diff with any version was still about 480 lines; so it looks like not all go files were copied over. however it seems likely it would have been d9815cdf0ff24e2efa9b8062f4f94a6dd347ae51 because our vendor dir does include that change, not some of the later changes, and the time works with that PR.
also: * newer version of dep uses multi-line format * it auto-added a bunch of constraints * needed to pin gocql. it was tricky to determine which version our gocql is supposed to be at the last update was #778 but we don't know what version of gocql that was exactly. None of the versions in gocql's last year of git history matches what we have in our vendor dir, in fact, the smallest diff with any version was still about 480 lines; so it looks like not all go files were copied over. however it seems likely it would have been d9815cdf0ff24e2efa9b8062f4f94a6dd347ae51 because our vendor dir does include that change, not some of the later changes, and the time works with that PR.
also: * newer version of dep uses multi-line format * it auto-added a bunch of constraints * needed to pin gocql. it was tricky to determine which version our gocql is supposed to be at the last update was #778 but we don't know what version of gocql that was exactly. None of the versions in gocql's last year of git history matches what we have in our vendor dir, in fact, the smallest diff with any version was still about 480 lines; so it looks like not all go files were copied over. however it seems likely it would have been d9815cdf0ff24e2efa9b8062f4f94a6dd347ae51 because our vendor dir does include that change, not some of the later changes, and the time works with that PR.
fixes #777